﻿ Improving an NP-Chunker by Exploting the Inner NP Dependency Structure Mihaela Colhon Dan Cristea University of Craiova, Department of “Alexandru Ioan Cuza” University of Iaş i, Faculty Computer Science, 13 Alexandru Ioan of Computer Science, 16 Berthelot St Iaş i Cuza St Institute of Computer Science, the Iaş mcolhon@inf ucv ro i branch of the Romaniandcristea@info uaic ro Academy, 2 Codrescu St ABSTRACT Particularly interesting from the HCI perspective are the In this article we extend our previous work dedicated to research conducted in the domain of linguistics or the developing an automatic method for generating and works put into constructing treebanks, both monolingual applying syntactic patterns intended to recognize noun and parallel phrases The intended scope of our studies is to automatically detect dependency relations within noun As an exemplification of the necessity of reliable natural phrases exploiting some syntactic information of the language processing tools, let us consider the following phrases’ words The patterns are extracted from a corpus question-answer example: that has been automatically annotated for token and the information: "[ ] Prices of sunflower oil rose by 27% sentence limits, part of speech and noun phrases and due to low production [ ]" manually annotated for dependency relations between words The patterns have been generalized in order to cover and the question: "For which food the price has increased more instances that were present in the corpus and to reduce by 27%?" the size of the grammar In this paper we refine the The answer "sunflower oil" is within the nominal group syntactic patterns by considering not only the two terms of "Prices of sunflower oil ", so the correct identification of the dependency relation but also the terms surrounding the nominal group as well as the disambiguation of its these two that will be treated as contexts: left, middle, right structure is absolutely necessary in order to provide the depending on their position from the dependency terms correct answer If the parser could not recover the structure of the nominal group then the correct answer candidate Author Keywords Natural Language Processing; Dependency Parsing; HCI could not be found even if the phrase segment containing one or more nominal groups was been correctly identified ACM Classification Keywords This problem also affects anaphase resolution systems, H 5 2 User Interfaces (e g , HCI): Natural language automatic syntactic analysis, but also automatic translation General Terms using syntactic trees Human Factors; Design; Measurement Automatic syntactic parsing of a text is used either singularly or in some linguistic processing chains in order INTRODUCTION to ease their purpose and to enhance precision, such as: A way to incorporate into user interfaces knowledge coreference chains, entity recognition, nominal group expressed in natural language or modalities for processing (study in which the present work is included), or communicating in a human to human style can be achieved automated translation of a text Syntactic parsing is the next by implementing Natural Language Processing tasks In the level following morphological labelling, the latter being at context of Human Computer Interaction (shortly, HCI) the the base of any processing chain Natural Language Processing (shortly, NLP) issues range from various speech recognition systems, to natural Two of the most used formalisms to describe syntactic language interfaces for various applications, or a multitude structures are in terms of hierarchies of constituents and of machine translation systems dependency relations Contiguous sequences of words are grouped under non-terminal symbols (as constituents), part Natural language processing applications, such as of context free grammars, while dependency relations information retrieval, automatic translation, sentiment are asymmetrical functional relations between pairs of analysis or applications that automatically have to answer words, considered head and modifier questions, require both semantic analysis and morpho- syntactic analysis of texts at different levels While these two traditions have sometimes been presented as competing with each other, a straightforward correspondence between a projective dependency analysis - 1 - (in which there are no crossing links) and a constituent dependency terms We work on Romanian, but the method structure analysis seems to exist is general enough to be applicable to other languages as Reliable dependency parsing is a notorious difficult well problem in Natural Language Processing We describe The paper is organized as follows: in Section 2 we present in this paper a pattern-based approach in dependency the Treebank corpus we use for our study Section 3 parsing that addresses only Nominal Phrases (NPs) describes the manner in which the collection of patterns To build a dependency treebank, the human experts must extracted from an automatic annotation to NP chunks over decide for each word which is the one it depends on the Treebank are organized Section 4 describes the method However, rather often there is no consensus among and the results while Section 5 formulates a number of annotators on what the correct dependency structure for a conclusions particular sentence should be, because the decision THE TREEBANK regarding dependencies involve a deep interpretation At , a dependency Treebank for the process Romanian language was built We have used this resource In contrast, in building phrase-structures, human annotators in the training and evaluation stages of our proposed are confronted to much less ambiguity This is because only mechanism a sequence of constituents should be indicated in the right The corpus contains Romanian texts selected from a wide hand side of a grammar rule, not also their relative range of genres/registers of language2 Three levels of roles/functions with respect to the parent constituent, as annotation have been added to the raw text and encoded in indicated by the left symbol It is therefore normal that the XML, by adopting a simplified form of the XCES standard effort of establishing correct dependency structures be paid : back, and the difference stays in a much closer resemblance Level-1: segmentation and lexical information of a dependency structure to a semantic interpretation then  in the case of a constituent structure This aspect is even Sentences have their boundaries manually marked more important in the case of languages that have a free and each token has attached its part of speech, word order (as are for instance Czech, Romanian, etc ), lemma, and morpho-syntactic information, by where dependency treebanks are preferred to constituent running an automatic processing chain that structure representations (see, for instance the Prague includes: tokenisation, POS-tagging and Dependency Treebank ) lemmatisation ; Level-2: noun phrases By exploiting Level-1 Using treebank data for training and evaluation of parsing  systems is identified under the name of treebank parsing, a information, an NP-chunker adds information methodology that has been used to construct robust and regarding noun phrase boundaries and their head efficient parsers for several languages over the last ten words; years For this kind of parsing, the treebank data is  Level-3: syntactic dependency data During a used to train the parser but also to evaluate the quality of manual annotation phase each token of all the resulted parser with respect to accuracy as well as sentences has been complemented with its head- efficiency Calacean and Nivre report results on a word and the dependency relation towards the MaltParser-based dependency for Romanian, trained head on a manually annotated Romanian Treebank1 built in the RORIC-LING project Their precision for recognizing The Treebank thus acquired contains 2,630 sentences, in labelled relations are between 60 8% and 95 9%, depending which 7674 NP structures were identified by the NP on the length of the link, while the recall is in the range chunker 71 3% - 96 3% Their corpus includes only short sentences (with an average of 8 94 tokens per sentence) and a gold THE DATABASE OF NOUN PHRASES PATTERNS standard part-of-speech annotation The primary data extracted from the corpus is used to associate dependency structures with each sequence of In this paper we present an on-going research started in MSD3 tags corresponding to NPs extracted from the corpus, 2014 on a treebank dependency parsing mechanism that is restricted to NP chunks The presented study refines the syntactic patterns defined in by considering not only the two terms of the dependency relation but also the terms 2 The corpus mainly includes Romanian translations of the surrounding these two that will be treated as contexts: left, first chapter of George Orwell's novel "1984", Romanian middle, right depending on their position from the parts from the JRC-Acquis corpus, Romanian Wikipedia texts, grammar texts used in high schools, etc 13 Morpho Syntactic Description – notation used in the Multext http://www phobos ro/roric/texts/xml/ projects - 2 - which we will call in the following morphological head is noted in the exterior head of the structures head word of the NP The head of an NP chunk is usually The corpus includes only contiguous NPs As we have attached, in the syntactic tree of the sentence, to another already said, the corpus used in this study puts in evidence word which, of course, is not part of that NP sequence three levels of annotations: POS tags, NP chunks and For example, to the NP “lamele de ras tocite”, the dependency relations For each NP chunk, we extracted the following entry corresponds in the database: morphological structure and the configuration of dependency relations manually marked among the words of the NP chunk To syntactic constituents of the sentence correspond dependency structures organized in (sub)trees Each node of such a (sub)tree is a word, connected with a dependency Based on the dependency data coded in the corpus the relation to its head word The roots of these (sub)trees are entries of the database are grouped in three categories: the only elements related to words outside the constituents  with no external heads: no 0-rel in the field; sentence which is not related to another word In the same rels with one external head: exactly one 0-rel in the way, the head word of an NP is the only word belonging to  the NP related outside the NP field;  with more than one external head: minim two 0- For example, if we take the following bracket rel in the fields representation for a Romanian noun phrase „lamele de ras Using these representations one can easily detect the tocite” (En ”the blunt razor blades”): incorrect NP sequences marked in the corpus but also the [NP [Ncfpry lamele] [Spsa de] [Ncfsry ambiguous sequences with respect to their ras] [Afpfp-n tocite]] dependency structures its morphological structure is: Indeed, exploiting this grouping, the NP chunks incorrectly Ncfpry Spsa Ncfsry Afpfp-n4 marked in the corpus during the automatic NP-chunking results immediately as they correspond to and its internal dependencies are: entries which include more than one 0-rel in the a subst (lamele-1,de-2) sequence prep (de-2,ras-3) Opposite to this case, sequences with one a adj (lamele-1,tocite-4) single external head represent correct NP chunks that are In the notations above the name of relation is placed in connected with the other words of the sentences they front of a pair of words, the first one being the head and the belong to by means of their heads second - the modifier word The number attached to a word Correct MSD sequences of NP chunks are also the entries in the database with no 0-rel These cases A database table is built out of the 3 layers of notations in correspond to verb elliptical sentences where the NP is the the corpus: each record in this table includes a triplet very root of the sentence - the main verb is missing The ambiguous dependency constructions popup immediately as being the sets of entries in the database that in which represents a morphological structure have the same sequence (a pattern of MSD tags), represents the sequence of dependency relations (transferred from the APPLYING PATTERNS words to the MSD representations on the respective Our method for generating and applying corpus based positions - the first position is counted as 1) and is the sequence of head positions in the pattern5 Figure 1): starting from flat NP sequences and using a set of The 0-rel always marks in the relation of syntactic patterns extracted from the training corpus, we the head word of the NP going out of the NP, and with 0-identify dependency links between the words of the NPs, based on their MSD finger-print By this, the flat NP sequences become dependency sub-trees In order to do 4 See the Appendix for the meaning of the MSD tags in this that, the dependency relations, considered independent one paper of the others, are decoupled from the particular 5morphological structures they occur in The set of contexts Of course, the property of unique occurrence of a record in the database is observed here as well of each relation is then tried to be generalized The aim of - 3 - the generalization is to reduce the number of dependency  right context: is represented by the sequence of relations patterns extracted from the corpus, but also to MSD tags appearing to the right of the second infer deducible sequences not instantiated in the corpus element involed in the targeted dependency During the pattern-generalization process, the following relation R steps are repeated for all records of the database in which a relation R occurs between identical tags: For each dependency relation, the dependents are represented by two MSD tags: for each sequences in the database, three 1: the MSD tag for the head types of contexts are marked:  2: the modifier MSD tag  left context: is represented by the sequence of  MSD tags in appearing to the left of The contexts (if they are present in the pattern structure) the position of the first element involved in the may be optional or mandatory: the optional ones are dependency R; marked with ? and the mandatory ones with {1}  middle context: is represented by the sequence of MSD tags appearing in between the two tags involved in the targeted dependency relation input Segmentation Noun Phrases text MSD and POS-taggingExtraction Syntactic Morpho-syntactic Dependency ClassDependency dependencies Resolution Processing relation(s) Figure 1 Dependency Parsing Mechanism within Noun Phrases After the generalization process, all these representations Let us now take three examples of NPs: will be merged into a single one with an optional middle context: NP#1: “Fetele acestea frumoase” (En: “These beautiful girls”) NP#2: “Mâinile lui curate” (En: “His clean hands”) The middle context in the generalized pattern is optional because it can either contain one of the tags separated by NP#3: “Lumânarile aprinse” (En: “candles lit”) the “or” operator (“|”) or can be an empty one like in the In all these NPs the same relation (adjectival attribute, last example noted with “a adj ”) is found between words displaying After generalization, the patterns extracted from the corpus identical MSD tags (that is Ncfpry as head and Afpfp-n were grouped into 24 sets, each corresponding to one as modifier) The following patterns for this dependency relation and covering the span of an entire NP relation are found: The evaluation scores obtained are given in the next From NP#1: section EVALUATION From NP#2: The total Treebank sentences were split into a training set and a testing set The parser was trained on approximately 90% of the Treebank and evaluated on the remaining 10% From NP#3: using a 10-fold cross-validations policy, which guaranteed no intersection between training and evaluation sentences - 4 - escu, Mihaela Gabriela Țacu, For the testing set, the most frequent relations identified by 2 Marian Cristian Mihă the parser were: Dumitru Dan Burdescu 2014 Use Case of Cognitive - the adjectival attribute (a adj ), and HCI Analysis for an E-Learning Tool, Informatica 38: 273-279 - nominal attribute (a subst ), 3 Mihaela Călăcean, Joakim Nivre 2009 A Data-Driven - determiner (det ) and Dependency Parser for Romanian, Proceedings of TLT- - preposition (prep ) 7 4 Alexandra Cristina Cristea 2014 Syntactic Study on No restrictions of projectivity of the generated dependency Nominal Groups in Romanian (in Romanian), structures have been included at this moment The obtained Dissertation Thesis, “Alexandru Ioan Cuza” University scores are given in Table 1 As one can observe, the best of Iaş i accuracy was obtained for determiner relation while the lower was for the adjectival attribute relation Nevertheless, 5 Mihaela Colhon, Dan Cristea 2014 Automatic the evaluation scores give hope for continuing our pattern-Extraction of Syntactic Patterns for Dependency Parsing based approach in Romanian dependency parsing in Noun Phrase Chunks, Bucharest Working Papers in Linguistics, vol XVI, nr 1, ISSN 1454-9328 6 Tomaž Erjavec 2010 Multext-east version 4: Accuracy Multilingual morphosyntactic specifications, lexicons and corpora In LREC Precission Recall F-measure 7 Nancy Ide, P Bonhomme, L Romary 2000 Xces: An Dependency xml-based encoding standard for linguistic corpora In relations Proceedings of the Second International Language ALL 0 63 0 74 0 68Resources and Evaluation Conference Paris: European a adj 0 54 0 86 0 66Language Resources Association , Alena Böhmová, Eva Hajičová, Barbora a subst 0 67 0 72 0 698 Jan Hajič det 0 90 0 82 0 86Vidová-Hladká 2000 The prague dependency treebank: A three-level annotation scenario, A Abeillé, editor, prep 0 74 0 66 0 70Treebanks: Building and Using Parsed Corpora, Unlabelled Amsterdam:Kluwer: 103-127 relations 0 72 0 78 0 75 9 Florentina Hristea, Marius Popescu (eds ) 2003 Table 1: Evaluation scores Building Awareness in Language Technology, University of Bucharest Publishing House 10 Svetoslav Marinov, Joakim Nivre 2005 A Data-Driven CONCLUSION Dependency Parser for Bulgarian In Proceedings of It is well-known that there is a difficult trade in designing TLT 2005: 89-100 uk 1987 Dependency Theory: Syntax and proper generalization patterns, because making them too lax 11 Igor Mel’č could trigger false instances On the other hand, making Practice, Albany, NY:SUNY Press them too straight will imply low recall scores There is 12 Joakim Nivre, Johan Hall, Jens Nilsson 2006 perhaps more to be done in this direction MaltParser: A data-driven parser-generator for We are aware that the model would gain in precision if dependency parsing Proceedings of LREC-2006: 2216- lexical information would be included, by enriching the 2219 MSD tags of the generated patterns with lemmas In our 13 Cenel A Perez 2012 Casuistry of Romanian functional experiments till now we excluded lexical information, dependency grammar, M A Moruz, D Cristea, D because lexical information presupposes a much larger Tufis, A Iftene, H N Teodorescu (eds ) Proceedings of training corpus, which was not at our fingers during this the 8th International Conference “Linguistic Resources phase of research And Tools For Processing Of The Romanian Language”: 19-28 14 Radu Simionescu 2012 Graphical grammar studio as a REFERENCES constraint grammar solution for part of speech tagging, 1 Michael Brody 1994 Phrase structure and dependence M A Moruz, D Cristea, D Tufis, A Iftene, H N Working papers in the theory of grammar 1(1), Teodorescu (eds ) Proceedings of the 8th International Theoretical Linguistics Programme, Budapest Univ Conference “Linguistic Resources And Tools For Processing Of The Romanian Language”: 109-118 - 5 - 15 Radu Simionescu 2012 Romanian deep noun phrase chunking using graphical grammar studio, M A Moruz, Adjective qualifier positive D Cristea, D Tufis, A Iftene, H N Teodorescu (eds ) Afpfp-n feminine plural -definiteness Proceedings of the 8th International Conference ‘Linguistic Resources And Tools For Processing Of The Determiner demonstrative third Romanian Language”: 135-143 Dd3fpr- feminine plural direct 16 David Vadas, James R Curran 2011 Parsing Noun Noun common feminine plural Phrases in the Penn Treebank, School of Information Ncfpry direct +definiteness Technologies, University of Sydney, Australia Noun common feminine singular Ncfsry direct +definiteness APPENDIX Pp3mso- Pronoun personal third masculine The following table gives the Morpho-syntactic singular oblique descriptions (shortly, MSD) used in this paper Adposition preposition simple Spsa accusative The meaning of the notation MSD tag (according to MULTEXT-East lexical specifications) - 6 - 